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Abstract 

This paper provides an extensive study of the behavior of the best achievable rate (and other 
related fundamental limits) in variable-length lossless compression. In the non-asymptotic regime, 
the fundamental limits of fixed-to-variable lossless compression with and without prefix constraints 
are shown to be tightly coupled. Several precise, quantitative bounds are derived, connecting the 
distribution of the optimal codelengths to the source information spectrum, and an exact analysis of 
the best achievable rate for arbitrary sources is given. 

Fine asymptotic results are proved for arbitrary (not necessarily prefix) compressors on general 
mixing sources. Non-asymptotic, explicit Gaussian approximation bounds are established for the 
best achievable rate on Markov sources. The source dispersion and the source varentropy rate 
are defined and characterized. Together with the entropy rate, the varentropy rate serves to tightly 
approximate the fundamental non-asymptotic limits of fixed-to-variable compression for all but very 
small blocklengths. 

Keywords — Lossless data compression, fixed-to-variable source coding, fixed-to-fixed source coding, en- 
tropy, finite-blocklength fundamental limits, central limit theorem, Markov sources, varentropy, minimal coding 
variance, source dispersion. 
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I. Fundamental Limits 
A. The optimum fixed-to-variable code 

A fixed-to-variable compressor for a finite alphabet A is an injective function, 

f n : A n -> {0, 1}* = {0, 0, 1, 00, 01, 10, 11, 000, 001, . . .}. (1) 

The length of a string a G {0, 1}* is denoted by 1(a). Therefore, a block (or file) of n symbols 
a n = (ai, a 2 , . . . , a n ) G *4. n is losslessly compressed by f n into a binary string whose length 
is £(f n (a n )) bits. 

When the file X n = (X 1 , X 2 , ■ ■ ■ , X n ) to be compressed is generated by a probability 
law Px n , a basic information-theoretic object of study is the distribution of the rate of the 
optimal compressor, seen as a function of the blocklength n and the distribution Pxn . The best 
achievable compression performance at finite blocklengths is characterized by fundamental 
limits, including: 

1) R*(n, e): The lowest rate R such that the compression rate of the best code exceeds R 
with probability not greater than e: 

mm¥[£(f n (X n )) > riR] < e. (2) 

2) e*(n, k): The smallest possible excess-rate probability, namely, the probability that the 
compressed length is greater than or equal to k: 

e*{n, k) = minP[£(f n (X n )) > k]. (3) 

3) n*(R, e): The smallest blocklength at which compression at rate R is possible with 
probability at least 1 — e; in other words, the minimum n required for (|2]) to hold. 

4) R{n): The minimal average compression rate: 

R(n) = -minE[£(f n (X n ))] (4) 

oo 



iy; £>,*). (5) 



n 

k=l 

Naturally, the fundamental limits in 1), 2) and 3) are equivalent in the sense that knowledge 
of one of them (as a function of its parameters) determines the other two. For example, 

k 

R*( n , e) = - if and only if e*(n, k) < e < e*(n, k - 1). (6) 
n 

As for 4), we observe that, together with ([5]) and the fact that e*(n,0) = 1, @ results in: 

f 1 1 

R(n) = / R*{n,x)dx- -. (7) 



n 



The minima in the fundamental limits ([2]), (|3]), are achieved by an optimal compressor f* 
that assigns the elements of A n ordered in decreasing probabilities to the elements in {0, 1}* 
ordered lexicographically as in ([T]). In particular, 

R{n) = -E[£(r n (X n ))}, (8) 
n 
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and, 

R\n, e) is the smallest R s.t. P[£(f*(X n )) > nR] < e, (9) 

where the optimal compressor f* is described precisely as: 

Property 1: For every k — 1, . . . , |_l°g2(l + l«^-| n )J> anv optimal code f* assigns strings of 
length 0, 1, 2, . . . , k — 1 to each of the 

1 + 2 + 4 + . . . + 2 k ~ l = 2 k - 1, (10) 

most likely elements of A n . If log 2 (l + \A\ n ) is not an integer, then f* assigns strings of 
length |k>g 2 (l + |-4| n )J to the least likely \A\ n + 1 - 2L 1 °g 2 ( 1 +l-^l")J elements in A n . 

Note that Property [T] is a necessary and sufficient condition for optimality, which does not 
determine f* uniquely: not only does it not specify how to break ties among probabilities but 
any swap between two codewords of the same length preserves optimality. As in the following 
example, it is convenient, however, to think of f* as the unique compressor constructed by 
breaking ties lexicographically and by assigning the elements of {0, 1}* in the lexicographic 
order of ([T]). 

Example 1: Suppose n = 4, A = {o, •}, and the source is memoryless with F[X = •] > 
F[X = o]. Then the following compressor is optimal: 
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We emphasize that the optimum code f* is independent of the design target, in that, e.g., 
it is the same regardless of whether we want to minimize average length or the probability 
that the encoded length exceeds 1 KB or 1 MB. In fact, the code f* possesses the following 
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strong stochastic (competitive) optimality property over any other code f n that can be losslessly 
decoded: 

F[£{f n {X n )) >k]> F[£{r n {X n )) > k], for all k > 0. (11) 

Note that, although f* is not a prefix code, the decompressor is able to recover the source 
file a n exactly from f*(a n ) and its knowledge of n and Px n - Since the whole source file is 
compressed, it is not necessary to impose a prefix condition in order for the decompressor to 
know where the compressed file starts and ends. Removing the prefix-free constraint at the 
block level, which is extraneous in most applications, results in higher compression efficiency. 



B. Optimum fixed-to-variable prefix codes 

The fixed-to-variable prefix code that minimizes the average length is the Huffman code, 
achieving the average compression rate R p (n) (which is strictly larger than R(n)), defined 
as in but restricting the minimization to prefix codes. Alternatively, as in (|2]), we can 
investigate the optimum rate of the prefix code that minimizes the probability that the length 
exceeds a given threshold. If the minimization in ([2]) is carried out with respect to codes that 
satisfy the prefix condition then the corresponding fundamental limit is denoted by R p (n, e), 
and analogously e p (n, k) for ([3]). Note that the optimum prefix code achieving the minimum 
in ([3]) will, in general, depend on k. The following result shows that the corresponding 
fundamental limits, with and without the prefix condition, are tightly coupled: 

Theorem 1: Suppose all elements in A have positive probability. For all n — 1, 2, . . . 

1) For each k = 1, 2, ... : 

[n,k+L)-^ Q fc>nlog 2 |^|. 

2) If |.4.| is not a power of 2, then for < e < 1: 

R p (n, e) = R*(n, e) + -. (13) 

n 

If |*4| is a power of 2, then ( fT3| ) holds for e > mm a n eA n P x ^(a n ), while we have, 

R p (n, e) = R*(n, e) = log 2 1^| + -, (14) 



n 



for < e < min a n e _4n P x ™(a n ) 



Proof: 1) : fix k and n satisfying 2 k < \A\ n . Since there is no benefit in assigning 
shorter lengths, any Kraft-inequality-compliant code f£ that minimizes P[£(f„(X n )) > k) 
assigns length k to each of the 2 k — 1 largest masses of Px n - Assigning all the other elements 
in A n lengths equal to 

Cax= [A; + log 2 (|^r-2 fc + l)l, (15) 

guarantees that the Kraft sum is satisfied. On the other hand, according to Property 1, the 
optimum code f* without prefix constraints encodes each of the 2 k — 1 largest masses of Pxn 
with lengths ranging from to k — 1. Therefore, 

mZ{X n )) >k + l]= F[£(f* n (X n )) > k}. (16) 
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Alternatively, if 2 k > \A\ n , then a zero-error n-to-k code exists, and therefore e p (n, fc+1) = 0. 
2) : According to f* the length of the longest codeword is \n\og 2 l*4|J- Therefore, 

e*(n,\n\og 2 \A\]+l) = (17) 

and 

e>, [nlog 2 = | ° minaneAn Pxn{an) lAl ^ I of 2 (18) 
On the other hand, 1) implies 

e p (njnlog 2 |^n+l) = (19) 
e p (n, [nlog 2 |„4|]) = e*(n, [nlog 2 |^|] " 1) (20) 
> min P x 4a n ) (21) 

a n e.4 n 

Furthermore, i2 p (n, •) can be obtained from e p (n, •) through the counterpart to ([6]): 

i? p (n, e) = — if and only if eJn, i) < e < eJn, i — 1). (22) 
n 



Together with d6l) and (p"2]), ([22]) implies that <[T3|) holds if e > min a n g ^n Px"(a n )- Otherwise, 



(|T8])-([2T|) result in ( |T4[ ) when |^4| is a power of 2. If |^4| is not a power of 2 and < e < 

min a n 6 _4n Px«(a n ), then 

[nlog 2 

it (n, e) = (23) 

n 



C. r/ze optimum fixed-to-fixed almost-lossless code 

As pointed out in [|30l , [13111 , the quantity e*(n, fc) is, in fact, intimately related to the problem 
of almost-lossless fixed-to-fixed data compression. Assume the nontrivial compression regime 
in which 2 k < \A\ n . The optimal n-to-k fixed-to-fixed compressor assigns a unique string 
of length k to each of the 2 k — 1 most likely elements of A n , and assigns all the others to 
the remaining binary string of length k, which signals an encoder failure. Thus, the source 
strings that are decodable error-free by the optimal n-to-k scheme are precisely those that are 
encoded with lengths ranging from to k—1 by the optimum variable-length code (Property 1). 
Therefore, e*(n, k), defined in ([3]) as a fundamental limit of (strictly) lossless variable-length 
codes is, in fact, equal to the minimum error probability of an n-to-k code. Accordingly, 
the results obtained in this paper apply to the standard paradigm of almost-lossless fixed-to- 
fixed compression as well as to the setup of lossless fixed-to-variable compression without 
prefix-free constraints at the block level. 

The case 2 k > \A\ n is rather trivial: the minimal probability of encoding failure for an 
n-to-k code is 0, which again coincides with e*(n,k), unless \A\ n = 2 k , in which case, as 



we saw in (18) 



(n,k) = min P x 4a n ). (25) 

a n eA" 
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D. Existing asymptotic results 

Based on the correspondence between almost-lossless fixed-to-fixed codes and prefix- 
free lossless fixed-to-variable codes, previous results on the asymptotics of fixed-to-fixed 
compression can be brought to bear. In particular the Shannon-MacMillan theorem [|24|. ifTTl 
implies that for a stationary ergodic finite- alphabet source with entropy rate H, and for all 

< e < 1, 

lim R*(n, e) = H. (26) 

n— too 

It follows immediately from Theorem [T] that the prefix-free condition incurs no loss as far as 
the limit in (l26|) is concerned: 



lim RJn,e) = H, (27) 

Suppose X n is generated by a memoryless source with distribution 

P X n = P x x P x x ■ ■ ■ x P x , (28) 
and define the information random variable]^ 

'xW = ^p^xy (29) 

For the expected length, Szpankowski and Verdu [1271 show that the behavior of Q for 
non-equiprobable sources is, 

R(n) = H - ^log 2 n + , (30) 

which is also refined to show that, if i x {X) is non-latticej^] then, 

R(n) = H - —\og 2 (8rrea 2 n) + o(-) , (31) 
2n \n J 

where, 

a 2 = Var(^(X)), (32) 

is the varentropy or minimal coding variance [11 J of P x . In contrast, when a prefix-free 
condition is imposed, we have the well-known behavior (see, e.g., [0), 

R p {n) = H + 0\± i ), (33) 

for any source for which H = lim n ^ 00 ^H(X n ) exists. 

'A legacy of the Kraft inequality mindset, the term "ideal codelength" is sometimes used for ix{X). This is inappropriate 
in view of the fact that the optimum codelengths are in fact bounded above by ix(X); see Section [n] Therefore, these 
"ideal codelengths" are neither ideal nor are they actual codelengths. 

2 A discrete random variable is lattice if all its masses are on a subset of some lattice {y + fee ; k G Z}. 
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For a non-equiprobable source such that ix{X) is non-lattice, Strassen 11261 claim^] the 
following Gaussian approximation result as a refinement of (f26|): 



R*(n,e) 



+ 



H + 



(7 



n 



0' 



2n 



log 2 I 27Tcr ne 



(W-W-i)+og) 



(34) 



Here, 



/°° e * 2 ^ 2 ^ denotes the standard Gaussian tail function, a 2 is the varen- 



tropy of Px defined in (32), and /i 3 is the third centered absolute moment of the information 
random variable ( f29| ). 

Kontoyiannis [ITU gives a different kind of Gaussian approximation for the codelengths 
£{f n (X n )) of arbitrary prefix codes f n on memoryless data X n , showing that, with probability 
one, £(f n (X n )) is eventually lower bounded by a random variable that has an approximately 
Gaussian distribution, 



v 



£(f n (X n )) > Z n where Z n « N(nH,na 



(35) 



and a is the varentropy as in (32). Therefore, the codelengths £(f n (X n )) will have at least 



Gaussian fluctuations of 0(y/n); this is further sharpened in [fTTIl to a corresponding law of 
the iterated logarithm, stating that, with probability one, the compressed lengths £(f n (X n )) 
will have fluctuations of 0(yn In Inn), infinitely often: with probability one, 

£(f n (X n )) - H(X n ) 



lim sup 



V 2n In In 



> a. 



(36) 



n 



Both results ( [35] ) and ( |36| ) are shown to hold for Markov sources as well as for a wide class 
of mixing sources with infinite memory. 



E. Outline of main new results 

Section [XT] gives a general analysis of the distribution of the lengths of an optimal lossless 
code for any discrete information source, which may or may not produce fixed-length strings 
of symbols. First, in Theorems [2] and [3] we give simple achievability and converse bounds, 
showing that the distribution function of the optimal codelengths, F[£(f*(X)) <t], is inti- 
mately related to the distribution of the information random variable, P [%x(X) < t). Also we 
observe that the optimal codelengths £(f*(X)) are always bounded above by i x (X), but The- 
orem [4] states that they cannot be significantly smaller than %x{X) with high probability. The 
corresponding result for prefix codes, originally derived in fl2), [fTTIl . is stated in Theorem [5] 

Theorem [6] offers an exact, non- asymptotic expression for best achievable rate R*(n, e). So 
far, no other problem in information theory has yielded an exact non-asymptotic formula for 
the fundamental limit. An exact expression for the average probability of error achieved by 
(almost-lossless) random binning, is given in Theorem |7J 

General non-asymptotic and asymptotic results for the expected optimal length, R{n) = 



l/n)E[£(f*(X n ))], are obtained in Section III Attained by the Huffman code, the minimal 



3 See the discussion in Section |v] regarding Strassen's claimed proof of this result. 
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average length of prefix codes is unknown. However, dropping the extraneous prefix constraint 
for non-symbol-by-symbol codes results in an explicit formula for the minimal average length. 

In Section IV we revisit the refined asymptotic results (35) and (36) of ifTTIl . and show that 
they remain valid for general (not necessarily prefix) compressors, and for a broad class of 
possibly infinite-memory sources. 

Section [V] examines in detail the finite-blocklength behavior of the fundamental limit 
R*(n,e) for the case of memoryless sources. We prove tight, non- asymptotic and easily 
computable bounds for R*(n,e); specifically, combining the results of Theorems 16 and 17 
implies the following approximation for finite blocklengths n: 

Gaussian approximation I: For every memoryless source, the best achievable rate 
R*(n, e) satisfies: 

nR*(n, e) rs nH + a^/nQ" 1 (e) — - log 2 n, (37) 

where the approximation is accurate up to 0(1) terms; the same holds for R p (n, e) 

in the case of prefix codes. 
The approximation (37) is established by combining the general results of Section [II] with 
the classical Berry-Esseen bound |[T5ll , 11211 . This approximation is made precise in a non- 
asymptotic way, and all the constants involved are explicitly identified. 



In Section [VT| achievability and converse bounds (Theorems [T8] and [19| ) are established for 
R*(n, e), in the case of general ergodic Markov sources. Those results are analogous (though 
slightly weaker) to those in Section |Vj 

We also define the varentropy rate of an arbitrary source as the limiting normalized variance 
of the information random variables i x ^(X n ), and we show that, for Markov chains, it plays 
the same role as the varentropy defined in (32) for memoryless sources. Those results in 
particular imply the following: 

Gaussian approximation II: For any ergodic Markov source with entropy rate H 
and varentropy rate a 2 , the blocklength n*(R,e) required for the compression rate 
to exceed (1 + rj)H with probability no greater than e > 0, satisfies, 



n*((l + r])H,e) 



IP 



Q 



1 + ^7 



(38) 



[See Section IV for the general definition of the varentropy rate a , and the discus- 



sion in Section VI for details.] 



Finally, Section VII defines the source dispersion D as the limiting normalized variance 
of the optimal codelengths. In effect, the dispersion gauges the time one must wait for the 



source realization to become typical within a given probability, as in (38) above, with D in 
place of a 2 . For a large class of sources (including ergodic Markov chains of any order), the 
dispersion D is shown to equal the varentropy rate a 2 of the source. 
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II. Non-asymptotic Bounds for Arbitrary Sources 

In this section we analyze the best achievable compression performance on a completely 
general discrete random source. In particular, (except where noted) we do not necessarily 
assume that the alphabet is finite and we do not exploit the fact that in the original problem 
we are interested in compressing a block of n symbols. In this way we even encompass 
the case where the source string length is a priori unknown at the decompressor. Thus, we 
consider a given probability mass function Px defined on an arbitrary finite alphabet X, which 
may (but is not assumed to) consist of variable-length strings drawn from some alphabet. The 
results can then be particularized to the setting in Section [j] letting X A n and Px Px n - 
Conversely, we can simply let n — 1 in Section [I] to yield the setting in this section. 

The best achievable rate R*(n, e) at blocklength n = 1 is abbreviated as R*(e) = R*(l, e). 
By definition, R*(e) is the lowest R such that, 

¥[£(f*(X)) > R] < e, (39) 

which is equal to the quantile function^] of the integer- valued random variable £(f*(X)) 
evaluated at 1 — e. 



A. Achievability bound 

Recall the definition of the information random variable ix(X) in (29). Our goal is to 
express the distribution of the optimal codelengths £(f*(X)) in terms of the distribution of 
ix(X). The first such result is the following simple and powerful upper bound (e.g. 113010 on 
the tail of the distribution of the minimum rate. 

Theorem 2: For any a > 0, 

P [£(f*(X)) > a] < P [t x (X) > a] . (40) 

Proof: Since the labeling of the values taken by the random variable X is immaterial, 
it simplifies notation in the proofs to assume that the elements of X are integer-valued with 

decreasing probabilities: -Px(l) > -Px(2) > Then, for all i = 1,2,... we have the 

fundamental relationships: 

£(f*(i)) = Llog 2 ij (41) 
Px(i) < -■ (42) 



Therefore, 



F[£(f*{X))>a] = P[Llog 2 Xj>a] (43) 

< P [log 2 X > a] (44) 

< f[tx(X)>a], (45) 



where (45) follows from (42) 



4 The quantile function Q: [0, 1] — > K is the "inverse" of the cumulative distribution function F. Specifically, Q(a) = 
min{a;: F(x) = a} if the set is nonempty; otherwise a lies within a jump \im x i; Xa F(x) < a < F(x a ) and we define 
Q(a) = x a . 
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Before moving on, we point out that at the core of the above proof is a simple but crucial 
observation: not only does the distribution function of the optimal codelengths i(f*(X)) 
dominate that of i x (X), but we in fact always have, 

£(**(%)) < t x (x), for all xeX. (46) 

This will be the used repeatedly, throughout the rest of the paper. Also, a simple inspection 
of the proof shows that Theorem [2] as well as (46) remain valid even in the case of sources 
X with a countably infinite alphabet. 

Theorem [2] is the starting point for the achievability result for R*(n, e) established for 
Markov sources in Theorem T 



B. Converse bounds 

In Theorem [3] we give a corresponding converse result; cf. [|30ll . It will be used later to obtain 
sharp converse bounds for R*(n, e) for memoryless and Markov sources, in Theorems 17 
and [19} respectively. 

Theorem 3: For any nonnegative integer k, 

max {P [i x {X) >k + T]-2- T }<¥ [£(f*(X)) > k] . (47) 

T>0 



Proof: As in the proof of Theorem [2j we label the values taken by X as the positive 
integers in decreasing probabilities. Fix an arbitrary r > 0. Define: 

{1 e X: P x {i) < 2- k - T } (48) 
{l,2,...2 fc -l}. (49) 



£ 
C 



Then, abbreviating P X (B) = F[X e B] 
F[i x (X) >k + r] 



■Ei<= B p x(i), for any #C X, 
= Px(C) 

= p x (cnc) + p x (cnc c ) 
< p x (cnc) + p x (c c ) 



< 
< 



(2 k - 1)2- k - T + P X (C C 
2- T + P[Llog 2 Xj > k] 
2- T + F[£(r(X)) > k] 



(50) 
(51) 
(52) 
(53) 
(54) 
(55) 



where (55) follows in view of (41). 



Next we give another general converse bound, similar to that of Theorem [3} where this time 
we directly compare the codelengths £(f(X)) of an arbitrary compressor with the values of 
the information random variable i x (X). Whereas from (46) we know that £(f(X)) is always 
smaller than i x (X), Theorem [4] says that it cannot be much smaller with high probability. 
This is a natural analog of the corresponding converse established for prefix compressors in 
[0, and stated as Theorem [5] below. 

Applying to a finite- alphabet source, Theorem 4] is the key bound in the derivation of 
all the pointwise asymptotic results of Section [IV Theorems [TT] 12 and 13 It is also the 
main technical ingredient of the proof of Theorem [22] in Section VII stating that the source 
dispersion is equal to its varentropy. 
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Theorem 4: For any compressor f and any r > 0, 

P[W0) <ix{X)-r] < 2— (Llog 2 |A-|J +1) (56) 



Proof: Letting I{A} denote the indicator function of the event A, the probability in (|56 
can be bounded by 



n£(f(X))<tx(X)-r] = ^Px(x)l{Px(x)<2-^ f ^} (57) 

x£X 

2 -r^ 2 -/(f(x)) j (5g) 



< 



< 



xex 
Uog 2 \x\} 

2 -r ^ 2 j 2- j (59) 

j=0 



where the sum in (58 ) is maximized if f assigns a string of length j+1 only if it also assigns all 



strings of length j. Therefore, ( |59| ) holds because that code contains all the strings of lengths 
0, 1, . . . , [log 2 \X\\ - 1 plus \X\ - 2L lo §2 l*U + i < l*U strings of length [log 2 \X\\. U 
We saw in Theorem [T] that the optimum prefix code under the criterion of minimum excess 
length probability incurs a penalty of at most one bit. The following elementary converse is 
derived in 0, |fTTfl: its short proof is included for completeness. Indeed, the statements and 
proofs of Theorems [4] and [5] are close parallels. 

Theorem 5: For any prefix code f, and any r > 0: 

F[£(f(X))<i x (X)-T]<2-\ (60) 

Proof: We have, as in the proof of Theorem [4] leading to ( [58] ), 

F[£(f(X)) < i x (X) - r] < 2- T 2 ~ mx)) ( 61 ) 

xex 

< 2-\ (62) 



where (62) is Kraft's inequality. 



C. Exact fundamental limit 

The following result expresses the non-asymptotic data compression fundamental limit 
R*(e) — R*(l,e) as a function of the source information spectrum. 

Theorem 6: For all a > 0, the exact minimum rate compatible with given excess-length 
probability satisfies, 

J R*(e)=[log 2 (l + M(2 a ))l-l, (63) 

with, 

e = F[i x (X)>a], (64) 
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where M(/3) denotes the number of masses with probability strictly larger than |, and which 
can be expressed as: 



M{(3) = (3F[i x {X) < \og 2 (3] - J F[i x (X) < log 2 t] dt. 



(65) 



Proof: As above, the values taken by X are labeled as the positive integers in order of 
decreasing probability. By the definition of M(-), for any positive integer %, and a > 0, 

P x (i)<2- a <{=}► log 2 (l + M(2 a )) <log 2 z, (66) 

and it is easy to check that: 

\a] - 1 < [log 2 ij <=^ a < log 2 %. (67) 

Therefore, letting a = log 2 (l + M(2 a )) and letting the integer-valued X take the role of 



i, we obtain that p9[ ) is satisfied with equality if R is given by the right side of ( |63j ). Any 
smaller value of R would prevent ([39]) from being satisfied. 



The proof of ( [65] ) follows a sequence of elementary steps: 

M(/3) 



E 



I{P X (X) > I}" 



P 



P 



P 



~I{P X (X) > 1} 
^Px(X) 

1 



> t 



dt 







< Px(X) 



P 



dt 



Px(X) > 



dt 



f3F{i x (X)<log 2 (3] 



F[i x {X) < log 2 t] dt. 



(68) 
(69) 
(70) 

(71) 
(72) 
(73) 



While Theorem [6] gives R*(e) = R*(l,e) exactly for those e which correspond to values 
taken by the complementary cumulative distribution function of the information random 
variable z x (X), a continuous sweep of a > gives a very dense grid of values, unless 
X (whose alphabet size typically grows exponentially with n in the fixed-to-variable setup) 
takes values in a very small alphabet. From the value of a we can obtain the probability in the 



right side of ( f64| ). The optimum code achieves that excess probability e = P[£(f*(X)) > 
for lengths equal to, 



[a + log 2 (2- a + 2- a M(2 a ))l, 



(74) 



where the second term is negative and represents the exact gain with respect to the information 
spectrum of the source. 
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For later use we observe that, if we let M x (/3) be the number of masses with probability 
larger or equal than i, thenj^j 



M+(/3) = EM Px(x) -^l (75) 
= E [exp (i x (X)) I {i x {X) < log 2 /3}} . (76) 



Figure [T] shows the cumulative distribution functions of £(f*(X)) and i x (X) when X is a 
binomially distributed random variable: the number of tails obtained in 10,000 fair coin flips. 
Therefore, i x {X) ranges from 6.97 « 10, 000 - log 2 (5°) to 10, 000 and, 

H(X) = 7.69 (77) 
E[e(f*(X))] = 6.29, (78) 

where all figures are in bits. 



1.0 - 




2 4 6 8 10 

a 



Fig. 1: Cumulative distribution functions of £(f*(X)) and ix{X) when X is the number of tails obtained in 10,000 fair 
coin flips. 



D. Exact behavior of random binning 

The following result gives an exact expression for the performance of random binning 
for arbitrary sources, as a function of the cumulative distribution function of the random 
variable i x (X) via ( f65| ). In binning, the compressor is no longer constrained to be an injective 
mapping. When the label received by the decompressor can be explained by more than one 
source realization, it chooses the most likely one, breaking ties arbitrarily. (Cf. Il23ll for the 
exact performance of random coding in channel coding.) 



5 Where typographically convenient we use exp(a) = 2 a . 
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Theorem 7: Averaging uniformly over all binning compressors f: X — > {1,2, ... N}, 
results in an expected error probability equal to, 



1 — E 



(79) 



where M(-) is given in ( [65] ), and the number of masses whose probability is equal to Px(x 
is denoted by: 

¥[P X (X) =Px(x)} 



J(x) 



Px(x) 



(80) 



Proof: For the purposes of the proof, it is convenient to assume that ties are broken 



uniformly at random among the most likely source outcomes in the bin. To verify ( |79| ), note 
that, given that the source realization is x : 

1) The number of masses with probability strictly higher than that of x is M( p J^ x ^ ); 

2) Correct decompression of x requires that any x with P x (x) > P x (xo) not be assigned 
to the same bin as xq. This occurs with probability: 



1 



N 



M( 



(81) 



3) If there are £ masses with the same probability as xq in the same bin, correct decom- 
pression occurs with probability 

4) The probability that there are £ masses with the same probability as x in the same bin 
is equal to: 



(J(xo) 



N 



J(x )-t-l 



(82) 



Then, ( |79| ) follows since all the bin assignments are independent. ■ 

Theorem [7J leads to an achievability bound for both almost-lossless fixed-to-fixed com- 
pression and lossless fixed-to-variable compression. However, in view of the simplicity and 
tightness of Theorem [2[ the main usefulness of Theorem |7J is to gauge the suboptimality 
of random binning in the finite (in fact, rather short because of computational complexity) 
blocklength regime. 
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III. Minimal Expected Length 

Recall the definition of the best achievable rate R(n) in Section [j] expressed in terms of 
f* as in ([8]). An immediate consequence of Theorem [2] is the bound, 

nR{n) = E [£(f*(X))\ < H(X), (83) 

which goes back at least to the work of Wyner ll32l . Indeed, by lifting the prefix condition it 
is possible to beat the entropy on average as we saw in the asymptotic results ( [3Q| ) and pTj ). 
Lower bounds on the minimal average length as a function of H(X) can be found in lETTl and 
references therein. An explicit expression can be obtained easily by labeling the outcomes as 
the positive integers with decreasing probabilities as in the proof of Theorem [2j 

E[l(f*(X))} = E[Llog 2 Xj] (84) 

oo 

= ^P[Llog 2 Xj > k] (85) 

k=l 
oo 

= ^P[X>2*]. (86) 



k=l 



Example 2: The average number of bits required to encode at which flip of a fair coin the 
first tail appears is equal to, 



oo oo 



J>[X>2 fc ] = J2J2 2 ~ j (87) 

k=l k=l j=2 k 

oo 

= 2^2 2 " fe (88) 

k=l 

ps 0.632843, (89) 

since, in this case, X is a geometric random variable with F[X = j] = . In contrast, 
imposing a prefix constraint disables any compression: the optimal prefix code consists of 
all, possibly empty, strings of s terminated by 1, achieving an average length of 2. 

Example 3: If X M is equiprobable on a set of M elements, then: 
1) 

E [£(f*(X M ))} = [\og 2 M\ + ± (2 + Llog 2 MJ - 2^ 2 ^J+i) , (90) 



which simplifies to, 



EW f(A-„))] ^ M+1 )y + 1 ) -2, (91) 



when M + 1 is a power of 2. 

2) 



limsup#(X M )-E[£(r(X M ))] = 2 (92) 

M->oo 

UmmiH(X M )-E[£(r(X M ))] = 1 + log 2 e - log 2 log 2 e, (93) 

M^oo 
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where the entropy is expressed in bits. 
Theorem 8: For any source X = {Px n }^Li with finite entropy rate, 

H(X) = limsup -H{X n ) < oo, (94) 

n— too H 

the normalized minimal average length satisfies: 

limsup R(n) = H(X). (95) 



Proof: The achievability (upper) bound in ( [95] ) holds in view of ([83]). In the reverse 
direction, we invoke the bound (T): 

H(X n ) - E[£(f* n (X n ))} < \og 2 (H(X n ) + 1) + log 2 e. (96) 



Upon dividing both sides of (J96J) by n and taking limsup the desired result follows, since 
for any 5 > 0, for all sufficiently large n, H(X n ) < nH(K) + n5. ■ 

In view of ( ]33| ), we see that the penalty incurred on the average rate by the prefix condition 
vanishes asymptotically in the very wide generality allowed by Theorem [8] In fact, the same 
proof we used for Theorem [8] shows the following result: 

Theorem 9: For any (not necessarily serial) source X = {P X („)}* =1 , 

lim M = lim mux**))] 

as long as H(X^) diverges, where X^ E A n , an alphabet which is not necessarily a 
Cartesian product. 
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IV. POINTWISE ASYMPTOTICS 

A. Normalized pointwise redundancy 

Before turning to the precise evaluation of the best achievable rate R*(n, e), in this section 
we examine the asymptotic behavior of the normalized difference between the codelength 
and the information (sometimes known as the pointwise redundancy). 

Theorem 10: For any discrete source and any divergent deterministic sequence n n such 
that, 

log n 

lim = 0, (98) 

we have: 

(a) For any sequence {f n } of codes: 

liminf — (£(f n (X n )) - ix-(X n )) > 0, w.p.l. (99) 

(b) The sequence of optimal codes {f*} achieves: 

liminf — {£{r n {X n )) - ix^X n )) = 0, w.p.l. (100) 

n->oo K n 

Proof: (a) We invoke the general converse in Theorem |4j with X n and A n in place of 
X and X, respectively. Fixing arbitrary e > and letting r = r n = en n , we obtain that, 

P^(fnPn) < ixn(X n ) - en n ] < 2 l ^ n ~^ (log 2 |^| + 1) (101) 

which is summable in n. Therefore, the Borel-Cantelli lemma implies that the lim sup of the 
event on the left side of ( | 101 [ ) has zero probability, or equivalently, with probability one, 

e(f n (X n ))-i X n(X n )>-eK n 



is violated only a finite number of times. Since e can be chosen arbitrarily small, (|99j) follows. 
Part (b) follows from (a) and (|46|). ■ 



B. Stationary Ergodic Sources 

Theorem [8] states that for any discrete process X the expected rate of the optimal codes f * 
satisfy, 

lim sup -E[£(f*(X n ))] = H(X). (102) 

The next result shows that if the source is stationary and ergodic, then the same asymptotic 
relation holds not just in expectation, but also with probability 1. Moreover, no compressor 
can beat the entropy rate asymptotically with positive probability. The corresponding results 
for prefix codes were established in J2]|, ifTOll . ifTTTl . 

Theorem 11: Suppose that {X n } is a stationary ergodic source with entropy rate H. 
(i) For any sequence {f n } of codes, 

liminf —£(f n (X n )) > H, w.p.l. (103) 

7i— >oo n 
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(ii) The sequence of optimal codes {f*} achieves, 



lim -£(f*(X n )) = H, w.p.l. (104) 

n— >oo n 



Proof: The Shannon-Macmillan-Breiman theorem states that, 

1 



i x 4X n )->H, w.p.l. (105) 

n 



Therefore, the result is an immediate consequence of Theorem 10 with n n = n. 



C. Stationary Ergodic Markov Sources 

We assume that the source is a stationary ergodic (first-order) Markov chain, with transition 
kernel, 

Px'\x{x'\x) (x,x')eA 2 , (106) 

on the finite alphabet A. Further restricting the source to be Markov enables us to analyze 
more precisely the behavior of the information random variables and, in particular, we will 
show that the zero-mean random variables, 

g, = '*»(*■> -*(*-) , (107) 



are asymptotically normal with variance given by the varentropy rate, which generalizes the 
notion in d32l). 



Definition 1: The varentropy rate of a random process X = {Px™}^Li is 

a 2 = limsup -Var(z X n(X n )). (108) 

Some remarks are in order: 

• If X is a stationary memoryless process each of whose letters is distributed according 
to Px, then the varentropy rate of X is equal to the varentropy of X. The varentropy of 
X is zero if and only if it is equiprobable on its support. 

• In contrast to the first moment, we do not know whether stationarity is sufficient for 



lim sup = liminf in (108). 



• While the entropy-rate of a Markov chain admits a two-letter expression, the varentropy 
does not. In particular, if a 2 (a) denotes the varentropy of the distribution Px'\x{- \ a), 
then the varentropy of the chain is, in general, not given by E[cr 2 (X )]. 

• The varentropy rate of Markov sources is typically nonzero. For example, for a first 
order Markov chain it was observed in ll33l . Ifl2ll that a 2 = if and only if the source 
satisfies the following deterministic equipartition property: Every string x n+l that starts 
and ends with the same symbol, has probability (given that X\ = x\) q n , for some 
constant < q < 1. 

Theorem 12: Let {X n } be a stationary ergodic finite-state Markov chain, 
(i) The varentropy rate a 2 is also equal to the corresponding liminf of the normalized 



variances in (108), and it is finite. 
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(ii) The normalized information random variables are asymptotically normal, in the sense 
that, as n — > oo, 

tMX-)-mr)^ ^ (109) 



n 



in distribution. 

(iii) The normalized information random variables satisfy a corresponding law of the iterated 
logarithm: 



.. l X n(X n ) - H(X n ) 

hmsup : = a, w.p.l 



\J 2n In In n 
ix^{X n ) -H(X n ) 



lim inf 

V2nlnlnn 



-a, w.p.l 



(HO) 
(HI) 



Proof: 



(i) and (ii): Consider the bivariate Markov chain {X n = (X n ,X n+ i)} on the alphabet 
B = {(x,y) e A 2 : P X '\x{y\x) > 0} and the function /: B R defined by 



f(x,y) = i x >\x{y\x). 



(112) 



Since {X n } is stationary and ergodic, so is {X n }, hence, by the central limit theorem for 
functions of Markov chains [0 



1 n— 1 

v i=i 



converges in distribution to the zero-mean Gaussian law with finite variance 



lim -Var^i^pTlXO). 

7i— >oo n 



Furthermore, since 



ix-{X n ) - H{X n ) = % x «\xAX n \Xx) - H{X n \Xr) + (^(^i) - H(X 1 )) 



(114) 



(115) 



where the second term is boun ded, (|109[ ) must hold and we must have Y? = o 2 . 

(iii) Normalizing ( |1 13[ ) by \J 2n In In n in lieu of y^, we can invoke the law of the iterated 
logarithm for functions of Markov chains jH to show that the lim sup / lim inf of the sum 
behave as claimed. Since upon normalization, the second term in the right side of <\l 15[ ), 
vanishes almost surely, i x ™(X n ) — H(X n ) must satisfy the same behavior. ■ 



Together with Theorem 10 particularized to K n = yn, we conclude that the normalized 
deviation of the optimal codelengths from the entropy rate satisfies 

£(f* n (X")) - H t Ar/n „. 



(116) 



which is the same behavior as that exhibited by the Shannon prefix code ifPTTl . so as far as the 
pointwise y/n asymptotics the prefix constraint does not entail loss of efficiency. Similarly, 
the following result readily follows from Theorem 12 and Theorem 10 with k ti = y2n In Inn. 
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Theorem 13: Suppose {X n } is a stationary ergodic Markov chain with entropy rate H and 
varentropy rate a 2 . Then: 
(i) For any sequence of codes {f n }: 

£(f n (X n )) - H(X 



lim sup 



lim inf 

n— »oo 



V 2n In In n 

£(f n (X n )) - H(X* 



\/2n In In 



> <x, 



> -a, 



w.p.l; 
w.p.l. 



(117) 
(118) 
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(ii) The sequence of optimal codes {f*} achieves the bounds in ( 1 17 ) and (118) with equality. 



The Markov sufficient condition in Theorem 12 enabled the application of the central limit 
theorem and the law of the iterated logarithm to the sum in ( | 1 1 3 [ ) . According to Theorem 9.1 
of [1221 a more general sufficient condition is that {X n } be a stationary process with a(d) = 
0(d~ 336 ) and 7(d) = 0(d~ i8 ), with the mixing coefficients: 



l(d) 
a(d) 



maxE 

a<=A 



j x \x-_ 



(a|X.i,X_ 



l x \x-_ 



(a|X_i, X_2 ; • • • X_ 



sup {|P(S n A) - P(S)P(A)| ; A E J 70 ^, Be 



(119) 
(120) 



Here and J-^ denote the a-algebras generated by the collections of random variables 
(X ,X_ 1; . . .) and (X d ,X d+1 , . . .), respectively. The a(d) are the strong mixing coefficients 
(3]| of {X n }, and the 7(d) were introduced by Ibragimov in [8|. Although these mixing 
conditions may be hard to verify in practice, they are fairly weak in that they require only 
polynomial decay of a(d) and 7(d). In particular, any ergodic Markov chain of any order 
satisfies these conditions. 
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V. Gaussian Approximation for Memoryless Sources 

We turn our attention to the non-asymptotic behavior of the best rate R*(n,e) that can 
be achieved when compressing a stationary memoryless finite- alphabet source {X n G A} 
with marginal distribution Px, whose entropy and varentropy are denoted by H and a 2 , 
respectively. 

Specifically, we will derive explicit upper and lower bounds on R*(n,e) in terms of the 
first three moments of the information random variable %x(X). Although particularizing 
Theorem [6] it is possible, in principle, to compute R*(n,e) exactly, it is more desirable 
to derive approximations that are both easier to compute and offer more intuition into the 
behavior of the fundamental limit R*(n, e). 



Theorems 16 and 17 imply that, for all e G (0,1/2), the best achievable rate R*(n,e) 
satisfies, 



n 



< R*(n,e) 



H 



a 



n 



Q 



2n 



< 



n 



(121) 



The upper bound is valid for all n, the lower bound is valid for n > n as in ( |157 ), and 
explicit values are derived for the constants c, c' . In view of Theorem [Tj essentially the same 
results as in (|121[) hold for prefix codes, 



n 



< R p (n,e) - 



H + 



a 



n 



Q- 



e - 



2n 



< 



c' + l 



n 



(122) 



The bounds in equations ( |121| ) and ( |122[ ) justify the Gaussian approximation ( [37] ) stated in 
Section U 



Before establishing the precise non- asymptotic relations leading to ( 121[ ) and ( |122| ), we 
illustrate their utility via an example. To facilitate this, note that Theorem [2] immediately 
yields the following simple bound: 



Theorem 14: For all n > 1, e > 0, 

R*(n,e) < R u (n,e) 



(123) 



where R u (n,e) is the quantile function of the information spectrum, i.e., the lowest R such 
that: 



P 



1 n 

n ^-^ 

i=i 



> R 



< e. 



(124) 



In Figure [5J we exhibit the behavior of the fundamental compression limit R*(n,e) in the 
case of coin flips with bias 0.11 (for which H ps 0.5 bits). In particular, we compare R*(n, e) 
and R u (n,e) for e = 0.1. The non-monotonic nature of both R*(n,e) and R u {n 1 e) with n 
is not surprising: although the larger the value of n the less we are at the mercy of the 
source randomness, we also need to compress more information. Figure [2] also illustrates that 
R*(n,e) is tracked rather closely by the Gaussian approximation, 



R*(n,e) = H + Q-\e) 



1 i 

- 7T log 2 n ' 
n in 



(125) 



suggested by (121 ). 
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Fig. 2: The optimum rate 7?*(n,0.1), the Gaussian approximation R*(n, 0.1) in (125) , and the upper bound R"(n, 0.1) 
for a Bernoulli-0.11 source and blocklengths 200 < n < 2000. 



Figure|3|focuses the comparison between R*(n, 0.1) and R*(n, 0.1) on the short blocklength 
range up to 200 not shown in Figure [2] For n > 60, the discrepancy between the two never 
exceeds 4%. 



The remainder of the section is devoted to justifying the use of (125) as an accurate 



approximation to R*(n, e). To that end, in Theorems 17 and 16 we establish the bounds 



given in (121). Their derivation requires that we overcome two technical hurdles: 

1) The distribution function of the optimal encoding length is not the same as the distri- 
bution of \ YTi=\ ix(Xi); 

2) The distribution of - Ym=i l x(X{) is only approximately Gaussian. 



To cope with the second hurdle we will appeal to the classical Berry-Esseen bound [ 1511 . 



Theorem 15: Let {Zi} be independent and identically distributed random variables with 
zero mean and unit variance, and let Z be standard normal. Then, for all n > 1 and any a: 



P 



In ^— ' 

i=l 



< a 



P [Z < a] 



< 



(126) 



Invoking the Berry-Esseen bound, Strassen 11261 claimed the following approximation for 



n > 



19600 
,510 



R*(n,e) -R*{n,e) 



< 



140 



(127) 
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Fig. 3: The optimum rate R*(n, 0.1) and the Gaussian approximation R*(n, 0.1) in ( |125| >, for a Bernoulli-0.11 source and 
blocklengths 10 < n < 200. 



where 



5 < min < a, e, 1 



^ = E[\t x (X)-H\ 3 ]. 



1/3 



(128) 
(129) 



Unfortunately, we were not able to verify how [|26l justifies the application of ( |126| ) to bound 
integrals with respect to the corresponding cumulative distribution functions (cf. equations 
(2.17), (3.18) and the displayed equation between (3.15) and (3.16) in H261H . 
The following achievability result holds for all blocklengths. 

Theorem 16: For all < e < | and all n > 1, 



R*(n,e) < H 



n In 



+ -log 

n 



log 2 e Afa 



v/2 



7TO" z 



O"' 



/'3 



-i 



(130) 



as long as the varentropy a 2 is strictly positive, where $ = 1 — Q and </> = $' are the standard 
Gaussian distribution function and density, respectively. 
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Proof: The proof follows Strassen's construction, but the essential approximation steps 
are different. The positive constant (3 n is uniquely defined by: 



F[ iXn (X n ) <\og 2 f3 n ] > 1-e, 
F[i X n(X n ) <log 2 /3 n ] < 1-e. 



(131) 
(132) 



Since the information spectrum (i.e., the distribution function of the information random 
variable ix™{X n )) is piecewise constant, log 2 /3 n is the location of the jump where the 
information spectrum reaches (or exceeds for the first time) the value 1 — e. Furthermore, 
defining the normalized constant, 

log 2 /3 n - nH 



A, 



y/na 



the probability in the left side of ( |131[ ) is, 

~i x ™(X n ) -nH 



P 



< A r 



na 



(J>3 



where we have applied Theorem 15 Analogously, we obtain, 



P 



i x ™{X n ) - nH 



< X r 



>$(A r 



/'3 



2a 3 y/n' 



(133) 



(134) 



(135) 



Since 1 — e is sandwiched between the right sides of ( |134[ ) and ( |135[ ), as n — > oo we must 
have that, A n — > A, where, 



A = $- 1 (l-e)=g- 1 ( e ). 
By a simple first-order Taylor bound, 

A„ < $- x ($(A)+ ^ 
= A + 



2a 3 y/n 



V-3 



A + 



1*3 1 



(136) 

(137) 
(138) 
(139) 



2 O r3 v ^0($-l(e„))' 

for some £ n e [$(A), $(A) + ^SJ^]. Since e < 1/2, we have A > and $(A) > 1/2, so that 
£n > 1/2. And since is strictly increasing for all t, while is strictly decreasing for 

t > 0, from < |139| ) we obtain, 

An< A+ . ,^ 77^ , „, (140) 



2a3 v ^0($- 1 (<f(A) + 2 fe))' 



The event E n in the left side of d 1 3 1| ) contains all the "high probability strings," and itself it 
has probability > 1 — e. Its cardinality is M^((3 n ), defined in ( |75| ) (with X <— X n ). Therefore, 
denoting, 



p(t) = 2~*I{t > 0} 
K: = -(zxM-F) 



(7 



(141) 
(142) 
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we obtain, 



with, 



R*(n,e) < ilog 2 M+(/3 n ) 

= ilog 2 E[exp(z X n(X")) l{i X n{X n ) < log 2 /3J] 
n 

H + X n A= + -log 2 a n , 
<n n 



a,. 



E[p(log 2 f3 n -i xn (X n ))\ 



E 



P 



(143) 
(144) 
(145) 

(146) 
(147) 



and where the {1^} are independent, identically distributed, with zero mean and unit variance. 
Let a n be defined as ( |147[ ) except that Yi are replaced by % which are standard normal. Then, 
straightforward algebra yields, 



ft, 



E 



2 -vA«T(A„-Y 1 ) I |y i < ^| 



P 2a 2 n 

2~ x dx 



s/2 



< 



log 2 e 
\j2ixo'' 



(148) 

(149) 
(150) 



n 



To deal with the fact that the random variables in ( 147 ) are not normal, we apply the Lebesgue 



Stieltjes integration by parts formula to (147). Denoting the distribution of the normalized 



sum in ( |147| ) by F n (t), a n becomes, 



ft,. 



{Xn - t] )dF n {t) 
F n {\ n ) - [ K F n {t)^ia2-^ x ^)dt\og e 2 

J — oo 

a n + F n (\ n ) -$(A„) 

(F n (t) - ®(t)) v ^a2-^ a{x "-V)dt\og e 2 



< a n + 
= a>n + 



A*3 



2o*^i 
a 3 \/n 



20" 2 J-oo 



< 



1 ( lo S2 e + yU 3 



n \^2-na 2 



(151) 
(152) 

(153) 

(154) 
(155) 

(156) 



where (154) follows from Theorem 15 The desired result now follows from (145) after 



assembling the bounds on A n and a n in ( 140) and (156), respectively. 
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Next we give a complementary converse result. 
Theorem 17: For all < e < | and all n such that, 



n > n 



4 



1 + 



the following lower bound holds, 



R*{n,e) >H + ^=Q-\e 



M3\ 2 1 

2o*) (0(Q-i( e ))g-i( e )) 2 ' 

log 2 n 



f +<r 3 



as long as the varentropy o" 2 is strictly positive. 
Proof: Let, 



na 2 0(g- 1 (e)) : 



2(7 2 



+ <7 



0(Q _1 ( e ))' 



and consider, 



P 



P 



^ix{Xi) >Hn + ayfnQ- l {e) - r) 

V 



i=l 



i=l 



0\ n 



G\ n 



5 «(«-'«> -j^s) 



/*3 



> e 



e + 



1 



0(<3 _1 ( e )) 



2cr 3 Vn 

/^3 



2(T 3 



n 



n 



(157) 



(158) 



(159) 



where ( |161[ ) follows from Theorem [T5j and ( |162[ ) follows from, 

Q(a - A) > Q(a) + A0(Q(a)), 
which holds at least as long as, 

A 

a > — > 0. 

Letting a = Q _1 (e) and A = fl!65| ) is equivalent to fll57p . 

We proceed to invoke Theorem [3] with X -f- X™, k equal to n times the right side of ( |158[ ), 
and r = |log 2 n. In view of the definition of R*(n,e) and ( |160[ )-( fT63] ), the desired result 
follows. ■ 



(160) 

(161) 
(162) 

(163) 
(164) 
(165) 
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VI. Gaussian Approximation for Markov Sources 

Let {X n } be an irreducible, aperiodic, A;th order Markov chain on the finite alphabet A, 
with transition probabilities, 

P x , lxk (x k+1 \x k ), x k+1 eA k+1 , (166) 

and entropy rate H. Note that we do not assume that the source is stationary. In Theorem [12] 



of Section IV we established that the varentropy rate defined in general in equation ( |108[ ) 
for stationary ergodic chains exists as the limit, 

1 



a- = lim -Var(^(X n )). 

n— >oo n 



(167) 



An examination of the proof shows that, by an application of the general central limit 
theorem for (uniformly ergodic) Markov chains |0, [fl9l . the assumption of stationarity is 



not necessary, and (167) holds for all irreducible aperiodic chains. 

Theorem 18: Suppose {X n } is an irreducible and aperiodic kth order Markov source, and 
let e G (0, 1/2). Then, there is a positive constant C such that, for all n large enough, 

(168) 



nR*(n, e)<nH + ay/nQ~\e) + C 



where the varentropy rate a is given by (167) and it is assumed to be strictly positive. 



Theorem 19: Under the same assumptions as in Theorem 18 for all n large enough, 



nR*(n,e) > nH + o\/nQ 1 (e) — - log 2 n — C, 



where C > is a finite constant, possibly different from than in Theorem 18 



Remarks. 



1) By definition, the lower bound in Theorem 19 also applies to R 



(169) 



while in view 
provided C is 



2) 



Theorem [T] the upper bound in Theorem 18 also applies to R 
replaced by C + 1. 

Note that, unlike the direct and converse coding theorems for memoryless sources 



(Theorems [16] and [T7J respectively) the results of Theorems [T8] and [19] are asymptotic 
in that we do not give explicit bounds for the constant terms. This is because the main 



probabilistic tool we use in the proofs (the Berry-Esseen bound in Theorem 15) does 
not have an equally precise counterpart for Markov chains. Specifically, in the proof of 



Theorem 20 below we appeal to a Berry-Esseen bound established by Nagaev in [20], 



3) 



which does not give an explicit value for the multiplicative constant A in (174). More 
explicit bounds do exist, but they require additional conditions on the Markov chain; 
see, e.g., Mann's thesis |[T6ll . and the references therein. 

If we restrict our attention to the (much more narrow) class of reversible chains, then 
it is indeed possible to apply the Berry-Esseen bound of Mann [[TBI to obtain explicit 



values for the constants in Theorems [18] and [19] but the resulting values are pretty loose, 
drastically limiting the engineering usefulness of the resulting bounds. For example, in 
Mann's version of the Berry-Esseen bound, the corresponding right side of the inequality 



as in Theorem 15 is multiplied by a factor of 13000. Therefore, we have opted for the 



less explicit but much more general statements given above. 
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4) 



Similar comments to those in the last two remarks apply to the observation that Theo- 
rem [T8] is a weaker bound than that established in Theorem [T6| for memoryless sources, 



by a (1/2) log 2 n term. Instead of restricting our result to the much more narrow class 



of reversible chains, or extending the involved proof of Theorem 16 to the case of 
a Markovian source, we chose to illustrate how this slightly weaker bound can be 
established in full generality, with a much shorter and simpler proof. 



5) 



The proof of Theorem 18 shows that the constant in its statement can be chosen as 

2 At 



for all 



C 



n > 



<P(Q-\t)) 



8A 2 



(170) 



(171) 



where A is the constant appearing in Theorem 20 below. Similarly, from the proof of 
Theorem [19] we see that the constant in its statement can be chosen as 

a(A + l) 



for all, 



C 



n > 



+ 1, 



A + l 



(172) 



(173) 



Note that, in both cases, the values of the constants can easily be improved, but they 



still depend on the implicit constant A of Theorem 20 



As mentioned above, we will need a Berry-Esseen-type bound on the scaled information 
random variables, 

t x ™(X n ) -nH 



na 



Beyond the Shannon-McMillan-Breiman theorem, several more refined asymptotic results 
have been established for this sequence; see, in particular, [[8]|, 11221 . Il2~6ll . [1331 and the 



discussions in [1121 and in Section IV Unlike these asymptotic results, we will use the 
following non-asymptotic bound. 

Theorem 20: For an ergodic, A;th order Markov source {X n } with entropy rate H and 
positive varentropy rate a 2 , there exists a finite constant A > such that, for all n > 1, 

r i A 



sup 



P 



Q{z) 



< 



n 



(174) 



Proof: For integers i < j, we adopt the notation x\ and X- for blocks of strings 



i X j 



and random variables X i+1 , . . . ,Xj), respectively. For all x n+ G A 



\n+k 
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such that P X k (x k ) > and P X n X j-i (xj \ xi \) > 0, for j = k + 1, k + 2, . . . n + k, we have, 



i Xn (x n ) 



■j-k 
log 2 

k+n 



(175) 



— log 2 — 



j-k 



j-fc) 



+ A r 



where the function / : A' — > R is defined by, 

/(x fc+1 ) = i x , lxk ( Xk+1 \x k ) = log 



and, 



Denote 



log 2 



P xfc (x fe ) 



j — k 



IAJ < 5 = max 



log 2 



< oo, 



(176) 
(177) 

(178) 
(179) 

(180) 



■IFj=n+\ P X'\X*{Xj\X'j-k) 

where the maximum is over the positive probability strings for which we have established 

Q77) . 

Let {Y n } denote the first-order Markov source defined by taking overlapping (k+ l)-blocks 
in the original chain, 

Y n — (X n , X n+ x, . . . , X n+ k). (181) 
Since {X n } is irreducible and aperiodic, so is {Y n } on the state space, 



A' = {x k+1 € A k+1 : P x ,\ xk (x k+1 \x k ) > 0}. 



(182) 



Now, since the chain {Y n } is irreducible and aperiodic on a finite state space, condition (0.2) 
of 11201 is satisfied, and since the function / is bounded, Theorem 1 of [|20l implies that there 
exists a finite constant A\ such that, for all n, 



sup 



P 



> z 



awn 



-Q(z) 



< 



Ax 



n 



(183) 



where the entropy rate is H = E[/(Yi)] and, 

n 



s 2 = „'™-e[(E(/«)-^) 

3=1 



(184) 



30 



where {Y n } is a stationary version of {Y n }, that is, it has the same transition probabilities 
but its initial distribution is its unique invariant distribution, 



i = X k+1 ] = Tl(x k )P x ,\ xk (x k+1 | x"), 



(185) 



where n is the unique invariant distribution of the original chain {X n }. Since the function 
/ is bounded and the distribution of the chain {Y n } converges to stationarity exponentially 
fast, it is easy to see that ( |184| ) coincides with the source varentropy rate. 

Let F n (z), G n (z) denote the complementary cumulative distribution functions, 



F n {z) 
GJz) 



F [ix^{X n ) -nH > z^/na] 



F 



Em 



n 



H > z^fna 



Since F n (z) and G n (z) are non-increasing, (177) and (180) imply that 



A, 



F n (z) > G n (z + 5/V^) 
> Q(z + 5/y/^)- 



n 



> Q(z) 



A 



n 



(186) 
(187) 

(188) 
(189) 

(190) 



uniformly in z, where ( |189[ ) follows from (184), and ( |190| ) holds with A = Ai + 5/y/2n since 
Q'(z) = —(j>(z) is bounded by —\j\phx. A similar argument shows that, 

F n {z) < G n {z-S/y/n) 

< Q{z-5/Vn) + 
A 



A ± 



< Q(z) + 



(191) 
(192) 

(193) 



Since both (190) and (193) hold uniformly in z G R, together they form the statement of the 
theorem. ■ 



Proof of Theorem 18- Starting from Theorem [2] with X n in place of X and with, 



K n = nH + a^Q-\e) + C, 
where C will be chosen below, Theorem [2] states that, 

F[e(r n (X n )) > K n ] < F[i xn (X n ) > K n ] 



F 



< q(q 



-if 



c 



a\/n 



nH) > Q-\e 
A 



C 



awn 



n 



where ( |197[ ) follows from Theorem [20J Since, 

Q'{x) = -<j){x) 
< Q"( x ) = x<p{x) < 



x > 0, 



(194) 

(195) 
(196) 

(197) 

(198) 
(199) 
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a second-order Taylor expansion of the first term in the right side of ( |197| ) gives, 

C 1 



¥[£(r n (X n )) > K n ] < e- 
< e- 



CTy/n 
1 

G\fn 



2\f2ne \a^/n 



C \2 A 
+ 



T7 



(200) 



C 



i 



2v / 2~7re Vcr^/n 



C 



-Ax^, (201) 



and choosing C as in ( |170[ ) for n satisfying ( |171| ) the right side of (201) is bounded above 
by e. Therefore, P[£(f;(X n )) > #„] < e, which, by definition implies that nR*(n,e) < K n , 
as claimed. ■ 

Proof of Theorem 19' Applying Theorem [3] with X n in place of X and with S > and 
K n > 1 arbitrary, we obtain, 



F[£(f*(X n )) > K n ] > F[i X n(X n )>K n + 6 

1 



P 



> g( 



i x ™(X n )-nH) > 
A 



K n -nH + 5 



Kr, — nH + 5 



- 2 



-5 



awn 



where ( |204[ ) now follows from Theorem 20 Letting 5 = 5 n = | log 2 n and, 

<r(A+l) 



K n = nH + ay/n~Q~ 1 {e) -8- 



yields, 



:(x"))>ir n ]>g(g- 1 ( e ) 



Note that, since e G (0,1/2), we have g _1 (e) > 0. And since Q'(x) = 
two-term Taylor expansion of Q above gives, 



Z(X n )) > K n ] > e + A > c> 



(202) 
(203) 

(204) 
(205) 



(206) 
(x), a simple 

(207) 



for all, 



n > 



A + l 



g- i (e)0(g- 1 (e)) 



hence nR*(n, e) > K n — 1, as claimed. 
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VII. Source Dispersion and Varentropy 

Traditionally, refined analyses in lossless data compression have focused attention on the 
redundancy, defined as the difference between the minimum average compression rate and 



the entropy rate. As we mentioned in Section I-D if the source statistics are known, then 
the per-symbol redundancy is positive and behaves as O (-) when the prefix condition is 
enforced, while it is — ^- \og 2 n + 0(^), without the prefix condition. But since, as we saw in 
Sections MandLVTl the standard deviation of the best achievable compression rate is 0{-^=), 
the rate will be aominated by these fluctuations. Therefore, as noted in ifTTl . it of primary 
importance to analyze the variance of the optimal codelengths. To that end, we introduce the 
following operational definition: 

Definition 2: The dispersion D (measured in bits 2 ) of a source {Px™}^Li is, 

D = limsup -Var(£(f*(X n ))), (208) 



where £(fn(')) * s the length of the optimum fixed-to-variable lossless code (cf. Section I-A) 



As we show in Theorem 22 below, for a broad class of sources, the dispersion D is 
equal to the source varentropy rate a 2 defined in (| 108 ). Moreover, in view of the Gaussian 



approximation bounds for R*(n, e) in Sections [V] and VI - and more generally, as long as a 
similar two-term Gaussian approximation in terms of the entropy rate and varentropy rate can 
be established up to o(l/y/n) accuracy - we can conclude the following: by the definition of 



n*(R, e) in Section I-A[ the source blocklength n required for the compression rate to exceed 







H 2 








H 2 


\ 1 + V 



[l + r])H with probability no greater than e > is approximated by, 

n*((l + V )H,e)^^(^^-) (209) 

(210) 

i.e., by the product of a factor that depends only on the source (through H and D or a 2 ), 
and a factor that depends only on the design requirements e and 77. Note that this is in close 
parallel with the notion of channel dispersion introduced in [|23l . 

Example 4: Coin flips with bias p have varentropy, 

a 2 =p(l-p)log 2 ^, (211) 
p 



so the key parameter in ( |210[ ) which characterizes the time horizon required for the source 
to become "typical" is, 

D v — V 2 

-m = 7 <212) 

p + 



log P 



-1 



log(l-p) 

Example 5: For a memoryless source whose marginal is the geometric distribution, 



P x (k) = q(l-q) k } (213) 
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the ratio of varentropy to squared entropy is, 



where h denotes the binary entropy function. 




Fig. 4: Normalized dispersion as a function of entropy for memoryless sources 



Figure [4] compares the normalized dispersion to the entropy for the Bernoulli, geometric and 
Poisson distributions. We see that as the source becomes more compressible (lower entropy 
per letter), the longer the horizon over which we need to compress in order to squeeze most 
of the redundancy out of the source. 

Definition 3: A source {X n } taking values on the finite alphabet A is a linear information 
growth source if any nonzero-probability string has probability bounded below by an expo- 
nential, that is, if there is a finite constant A and and an integer N > 1 such that, for all 
n > N , every nonzero-probability string x n E A n satisfies 

i X n(x n )<An. (215) 



Any memoryless source belongs to the class of linear information growth. Also note that, 
every irreducible and aperiodic Markov chain is a linear information growth source: Writing 
q for the smallest nonzero element of the transition matrix, and tt for the smallest nonzero 
probability for X\, we easily see that (215) is satisfied with N = 1, A = log 2 (l/g7r). The 
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class of linear information growth sources is related, at least at the level of intuition, to the 
class of finite-energy processes considered by Shields ||25l and to processes satisfying the 
Doeblin-like condition of Kontoyiannis and Suhov iTPiTl . 

We proceed to show an interesting regularity result for linear information growth sources: 

Lemma 1: Suppose {X n } is a (not necessarily stationary or ergodic) linear information 
growth source. Then: 



lim -E 

n— >oo H 



t(f* n (X n )) - Z X n(X* 



0. 



(216) 



Proof: For brevity, denote £ n = £(f*(X n )) and i n = i x ™{X n ), respectively. Select an 
arbitrary r n . The expectation of interest is 



E[(4 



E[(4 



> i n - r n }\ + E[(4 



<i n -r n }}. (217) 



Since i n < i n , on the event {£ n > % n — r n }, we have (£ n — i n ) 2 < t%. Also, by the linear 
information growth assumption we have the bound < % n — £ n < % n < Cn for a finite 
constant C and all n large enough. Combining these two observations with Theorem [4] we 
obtain that, 



E[(4 



< r z n + C 2 n 2 F{£ n < i n - r n } 



< ri 



C 2 n 2 2- Tn (n\og 2 \A\ + 1) 



'^3o — T n 



(218) 
(219) 
(220) 



for some C < oo and all n large enough. Taking r n = 3 log 2 n, dividing by n and letting 
n — > oo gives the claimed result. ■ 

Note that we have actually proved a stronger result, namely, 

2n 



E 



0(log 2 n). 



£{V n {X n ))-i X n{X 

Linear information growth is sufficient for dispersion to equal varentropy: 

Theorem 21: If the source has linear information growth, and finite varentropy, then: 



(221) 



D 



a' 



(222) 



Proof: For notational convenience, we abbreviate H n for H(X n ). Expanding the defini- 
tion of the variance of £ n , we obtain, 



Var(£(f:(X"))) 
= E[(4-E[4]) 2 ] 



E 
E 



(4 - In) + {in ~ Hn) ~ E[4 - Z n ])) 



2n 



- lr. 



+ E[(i n - H n ) 



E 



-i n ]+2E{(£ n -i n )(i n -H n )] 



(223) 
(224) 
(225) 
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and therefore, using the Cauchy-Schwarz inequality twice, 

|Var(£(f:(X")))-Var(z^(X"))| 

= |E[(4 - i n f] - E 2 [£ n - i n ] + 2E[(4 - i n ){i n - H n ) 



< 2E\(L - v 



2{E[(£ n -t n ) 2 ]} 1/2 [Var(z X n(X n ))] 



n\\ll/2 



(226) 
(227) 



Dividing by n and letting n — > oo, we obtain that the first term tends to zero by Lemma [TJ 
and the second term becomes, 



E[(i 



n "n I 



11 



1/2 



Var(* X n(X")) 



n 



1/2 



(228) 



which also tends to zero by Lemma [T] and the finite- varentropy rate assumption. Therefore, 



lim -\\/ar(£(r n (X n ))) - Var(t Xn (X n ))\ = 0, 

n— >oo fi 



(229) 



which, in particular, implies that cr 2 = D. 



In view of ( |221[ ), if we normalize by T^nlogn, instead of n in the last step of the proof 
of Theorem 21 we obtain the stronger result: 

|Var(£(f:(X n ))) - Var( lX n(X n )))| = O {^\og 2 n) . (230) 

Also, Lemma [T] and Theorem 21 remain valid if instead of the linear information growth 
condition we invoke the weaker assumption that there exists a sequence r n = o(y / n), such 
that, 

max i x 4x n ) = o (2 e " /2 ) . (231) 

x n : P x n(x n )^0 

We turn now attention to the Markov chain case. 

Theorem 22: Let {X n } be an irreducible, aperiodic (not necessarily stationary) Markov 
source with entropy rate H. Then: 



1) The varentropy rate a 2 defined in (108) exists as the limit, 

a 2 = lim -Var(^n(X n ))). 

n— s>oo Tl 



(232) 



2) The dispersion D defined in (208) exists as the limit, 



D = lim -\/ar(£(f* n (X n ))). 

n— >oo n 



(233) 



3) D = a 2 . 

4) The varentropy rate (or, equivalently, the dispersion) can be characterized in terms of 
the best achievable rate R*(n,e) as, 



2 n(R*(n,e)-HY 
a = hm hm — = = hm hm n 



>0 n— >oo 



21ni 



>0 n— >oo 



R*{n,e)-H 



(234) 



as long as a 2 is nonzero. 
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Proof: The limiting expression in part 1) was already established in Theorem 12 of 
Section [TV] see also the discussion leading to ( |167[ ) in Section [VI] Recalling that every 
irreducible and aperiodic Markov source is a linear information growth source, combining 
part 1) with Theorem 21 immediately yields the results of parts 2) and 3). 

Finally, part 4) follows from the results of Section VI Under the present assumptions, 

(235) 



n 



Theorems [T8] and [19] together imply that there is a finite constant C% such that, 

VE(R*(n 7 e)-H)-aQ-\e) < \^ + °± 

for all e G (0, 1/2) and all n large enough. Therefore, 

lim n(R*(n, e) - H) 2 = a 2 {Q~ 1 {e)f. 

n— >oo 

Dividing by 21n^, letting e J, 0, and recalling the simple fact that (Q~ 1 (e)) 2 
e.g., Il28l Section 3.3]) proves ( |234[ ) and completes the proof of the theorem. ■ 

From Theorem 21 it follows that, for a broad class of sources including all ergodic Markov 
chains with nonzero varentropy rate, 

Var 



(236) 
2 In - (see, 



lim 



1. 



(237) 



>oo Var(z X n(X n )) 

Analogously to Theorem |9} we could explore whether ( |237[ ) might hold under broader con- 
ditions, including the general setting of possibly non-serial sources. However, consider the 
following simple example. 



Example 6: As in Example |3j let X M be equiprobable on a set of M elements, then, 



H(X M ) 
Var(^ M (X M )) 

limsupVar(£(r(X M ))) 

M->oo 

liminfVar(£(f*(X M ))) 



log 2 M 


1 



2 + 
2. 



4 



To verify ( |240| ) and ( |241[ ), define the function, 

K 

s(K) = ]T* 2 2* 
i=i 

= -6 + 2 K+1 {3-2K + K 2 ). 

It is straightforward to check that, 

E [£ 2 (f*(X M ))] 
Together with ([90]), ( |244[ ) results in, 

Var(£(r(X M )))=36 / -d/ + o(l) 



(238) 
(239) 

(240) 

(241) 



(242) 
(243) 



^ (,(Llog 2 Mj) - (Llog 2 Mj) 2 • (2^ M ^ 



M-l)). (244) 



(245) 
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with, 



M 



2i+Uog 2 A^J 
M 



(246) 



which takes values in (1,2]. On that interval, the parabola 3x — x 2 takes a minimum value 
of 2 and a maximum value of (3/2) 2 , and fl240] >, ( |241| ) follow. 

Although the ratio of optimal codelength variance to the varentropy rate may be infinity 
as illustrated in Example [6j we do have the following counterpart of the first-moment result 
in Theorem [9] for the second moments: 

Theorem 23: For any (not necessarily serial) source X = {-P^m}^, 



lim WKf^} = h 



(247) 



as long as the denominator diverges. 
Proof: Theorem |2] implies that, 

E[f(f;(I w ))] < E [4 w (lW)] . (248) 

Therefore, the lim sup in ( |247[ ) is bounded above by 1. To establish the corresponding lower 
bound, fix an arbitrary i9 > 0. Then, 



fc>i 



fc>i 

k>l 

> ^[pkw(^ (n) )>(i+^)rv^i 



- 2" 



fc>i 



where ( |252| ) follows by letting r = •§ \Vk~\ in the converse Theorem [3J Therefore, 



fc>i 



fe>i 



> fc 



rE[*i w (XW)] 



(249) 
(250) 
(251) 
(252) 

(253) 

(254) 
(255) 



(1 + tf) 3 

where C$, are positive scalars that only vary with Note that ( |253[ ) holds because a -v ^ 
is summable for all < a < 1; ( |254| ) holds because (1 + d)k > \Vk] 2 for all sufficiently 
large k; and ( |255[ ) holds because, 

/■fc+i 

/ (1-F(x))dx> l-F(Jfe + l), (256) 
Jk 
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whenever F(x) is a cumulative distribution function. Dividing both sides of ( |253[ )-( [255| ) by 
the second moment E[^. (n) (X^)] and letting n — > oo, we conclude that the ratio in ( |247| ) 
is lower bounded by (1 + $) -3 . Since ■§ can be taken to be arbitrarily small, this proves that 
the liminf ( |247[ ) is lower bounded by 1, as required. ■ 
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